2016-06-24

About me

  • Maintainer of plotly's R package for >12 months. Author/maintainer of many other R packages: LDAvis, animint, pitchRx, rdom, etc.
  • PhD candidate in the Department of Statistics at Iowa State University.
  • Spent last 6 months in Melbourne, Australia working with Dr. Dianne Cook, Dr. Heike Hofmann, and Dr. Rob Hyndman.
  • BA in Math & Econ, but don't ask me any Econ questions 😉

Workshop materials

Please install the following packages:

install.packages("quantmod")
install.packages("RColorBrewer")
install.packages("devtools")
# if you encounter errors, let me know!
devtools::install_github("ropensci/plotly#628")

Code for today's workshop is here. Please download it and follow along.

At times, I'll give you an exercise and stop talking to encourage hands-on learning.

Please ask questions!!!

Why interactive graphics?

  • "Any high-dimensional dataviz has to be summarized in some way, but interactivity allows us to get details" - (Dr. Karl Broman; JSM 2015)
  • Aids exploration – discover structure that may otherwise go unnoticed (see Wickham, Cook, Hofmann; 2015)
  • Aids communication & enhances presentability

  • Why interactive web graphics?
    • simple to share, portable (web browser)
    • encourages composability (e.g., plotly.js + leaflet.js)
    • encourages reproducibility (scriptable)

Before we start, some R basics

  • Combine values into a vector with c()
c(0, 1)
#> [1] 0 1
  • Assign values to a name with <- (or =)
x <- c(0, 1)
  • Avoid for loops and use built-in vectorized functions
sum(x + 10)
#> [1] 21
  • Although arcane at times, R has rich support for documentation, see ?sum

Extract named elements with $, [[, and/or [

x <- list(a = 1:5, b = "red")
x$a
#> [1] 1 2 3 4 5
x[["a"]]
#> [1] 1 2 3 4 5
x["a"]
#> $a
#> [1] 1 2 3 4 5

Data frames are essentially a list:

range(mtcars$mpg)
#> [1] 10.4 33.9

ggplotly

plotly can convert most ggplot2 plots http://ropensci.github.io/plotly/ggplot2

library(plotly)
p <- qplot(data = mpg, displ, hwy, geom = c("point", "smooth")) + facet_wrap(~drv)
ggplotly(p)

That's great, but…

  • ggplot2's interface wasn't designed for interactive graphics.
  • ggplot2 requires data frame(s) and can be inefficient (especially for time series).
  • plotly.js creates visualizations that ggplot2 simply can't.
  • A recent update to plot_ly() combines the flexibility of plotly.js with the elegance of a layered grammar of graphics.

  • To begin, suppose we have a numeric matrix:
str(volcano)
#>  num [1:87, 1:61] 100 101 102 103 104 105 105 106 107 108 ...
  • Side note: formulas can capture the environment in which objects are referenced.
str(~volcano)
#> Class 'formula'  language ~volcano
#>   ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
f <- function() { str(~volcano) }
f()
#> Class 'formula'  language ~volcano
#>   ..- attr(*, ".Environment")=<environment: 0x7fbf1cb65aa8>

No data frame necessary!

plot_ly(z = ~volcano)
#> No trace type specified. Applying `add_heatmap()`.
#> Read more about this trace type here -> https://plot.ly/r/reference/#heatmap

Functional interface

Most functions take a plotly object as input and return a modified plotly object.

# equivalent to before, but more explicit
add_heatmap(plot_ly(z = ~volcano))

Use the %>% operator to enhance readability, for example:

plot_ly(z = ~volcano) %>% add_surface() %>% plotly_POST()

is easier to read compared to:

plotly_POST(add_surface(plot_ly(z = ~volcano)))

Housing sales in Texas

library(plotly)
txhousing
#> Source: local data frame [8,602 x 9]
#> 
#>       city  year month sales   volume median listings inventory     date
#>      <chr> <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl>    <dbl>
#> 1  Abilene  2000     1    72  5380000  71400      701       6.3 2000.000
#> 2  Abilene  2000     2    98  6505000  58700      746       6.6 2000.083
#> 3  Abilene  2000     3   130  9285000  58100      784       6.8 2000.167
#> 4  Abilene  2000     4    98  9730000  68600      785       6.9 2000.250
#> 5  Abilene  2000     5   141 10590000  67300      794       6.8 2000.333
#> 6  Abilene  2000     6   156 13910000  66900      780       6.6 2000.417
#> 7  Abilene  2000     7   152 12635000  73500      742       6.2 2000.500
#> 8  Abilene  2000     8   131 10710000  75000      765       6.4 2000.583
#> 9  Abilene  2000     9   104  7615000  64500      771       6.5 2000.667
#> 10 Abilene  2000    10   101  7040000  59300      764       6.6 2000.750
#> ..     ...   ...   ...   ...      ...    ...      ...       ...      ...

Univariate plots

plot_ly(txhousing, x = ~sales)
#> No trace type specified. Applying `add_histogram()`.
#> Read more about this trace type here -> https://plot.ly/r/reference/#histogram

Your Turn 1

  • Easy: Make histograms of other numeric variables.
  • Medium: Make a histogram of log(sales) and sqrt(sales)
  • Hard: Make a density plot of log sales (Hint: use density() and add_area()).

See here for solutions.

Mean sales by city

library(dplyr)
d <- txhousing %>% 
  group_by(city) %>%
  summarise(m = mean(sales, na.rm = TRUE))
d
#> Source: local data frame [46 x 2]
#> 
#>                     city          m
#>                    <chr>      <dbl>
#> 1                Abilene  150.48663
#> 2               Amarillo  238.65241
#> 3              Arlington  423.98396
#> 4                 Austin 1996.68984
#> 5               Bay Area  502.61497
#> 6               Beaumont  177.05882
#> 7        Brazoria County   94.23121
#> 8            Brownsville   65.51892
#> 9  Bryan-College Station  186.74332
#> 10         Collin County 1084.98396
#> ..                   ...        ...

split-apply-combine

split-apply-combine-visualize

txhousing %>% 
  group_by(city) %>%
  summarise(m = mean(sales, na.rm = TRUE)) %>%
  arrange(m) %>%
  plot_ly(x = ~m, y = ~city) %>%
  layout(title ="Average monthly sales, by city", margin = list(l = 120))

Monthly sales by city

p <- txhousing %>%
  group_by(city) %>%
  plot_ly(x = ~date, y = ~sales, text = ~city) %>%
  layout(hovermode = "closest")
p

Log of monthly sales

add_lines(p, y = ~log(sales))

Highlight city on hover

Median house price

Top 5 cities wrt mean sales

top5 <- txhousing %>% 
  group_by(city) %>%
  summarise(m = mean(sales, na.rm = TRUE)) %>%
  arrange(desc(m)) %>%
  top_n(5)
top5
#> Source: local data frame [5 x 2]
#> 
#>            city        m
#>           <chr>    <dbl>
#> 1       Houston 5582.882
#> 2        Dallas 4364.289
#> 3        Austin 1996.690
#> 4   San Antonio 1722.444
#> 5 Collin County 1084.984

Sales in top 5 cities

top5plot <- txhousing %>%
  semi_join(top5) %>%
  plot_ly(x = ~date, y = ~sales, color = ~city)
add_lines(top5plot)

ColorBrewer palettes

RColorBrewer::display.brewer.all(type = "qual")

Provide the palette name

add_lines(top5plot, colors = "Dark2")

Your Turn 2

Similar to color/colors, there are also linetype/linetypes and symbol/symbols variable mappings. Use them to plot sales (or some other variable) for the top 5 cities.

See here for solutions.

Split-apply-plot?

txhousing %>%
  semi_join(top5) %>%
  group_by(city) %>%
  do(p = plot_ly(., x = ~log(sales), name = .$city))
#> Source: local data frame [5 x 2]
#> Groups: <by row>
#> 
#>            city                       p
#>           (chr)                   (chr)
#> 1        Austin <S3:plotly, htmlwidget>
#> 2 Collin County <S3:plotly, htmlwidget>
#> 3        Dallas <S3:plotly, htmlwidget>
#> 4       Houston <S3:plotly, htmlwidget>
#> 5   San Antonio <S3:plotly, htmlwidget>

txhousing %>%
  semi_join(top5) %>%
  group_by(city) %>%
  do(p = plot_ly(., x = ~log(sales), name = ~city)) %>%
  .[["p"]] %>% 
  subplot(nrows = 5, shareX = TRUE)

Economics data

economics 
#> Source: local data frame [574 x 6]
#> 
#>          date   pce    pop psavert uempmed unemploy
#>        <date> <dbl>  <int>   <dbl>   <dbl>    <int>
#> 1  1967-07-01 507.4 198712    12.5     4.5     2944
#> 2  1967-08-01 510.5 198911    12.5     4.7     2945
#> 3  1967-09-01 516.3 199113    11.7     4.6     2958
#> 4  1967-10-01 512.9 199311    12.5     4.9     3143
#> 5  1967-11-01 518.1 199498    12.5     4.7     3066
#> 6  1967-12-01 525.8 199657    12.1     4.8     3018
#> 7  1968-01-01 531.5 199808    11.7     5.1     2878
#> 8  1968-02-01 534.2 199920    12.2     4.5     3001
#> 9  1968-03-01 544.9 200056    11.6     4.1     2877
#> 10 1968-04-01 544.6 200208    12.2     4.6     2709
#> ..        ...   ...    ...     ...     ...      ...

Wide to long data

library(tidyr)
gather(economics, variable, value, -date)
#> Source: local data frame [2,870 x 3]
#> 
#>          date variable value
#>        <date>    <chr> <dbl>
#> 1  1967-07-01      pce 507.4
#> 2  1967-08-01      pce 510.5
#> 3  1967-09-01      pce 516.3
#> 4  1967-10-01      pce 512.9
#> 5  1967-11-01      pce 518.1
#> 6  1967-12-01      pce 525.8
#> 7  1968-01-01      pce 531.5
#> 8  1968-02-01      pce 534.2
#> 9  1968-03-01      pce 544.9
#> 10 1968-04-01      pce 544.6
#> ..        ...      ...   ...

gather -> split-apply-combine

economics %>%
  gather(variable, value, -date) %>%
  group_by(variable) %>%
  summarise(m = min(value))
#> Source: local data frame [5 x 2]
#> 
#>   variable        m
#>      <chr>    <dbl>
#> 1      pce    507.4
#> 2      pop 198712.0
#> 3  psavert      1.9
#> 4  uempmed      4.0
#> 5 unemploy   2685.0

economics %>%
  gather(variable, value, -date) %>%
  group_by(variable) %>%
  do(p = plot_ly(., x = ~date, y = ~value, name = ~variable)) %>%
  .[["p"]] %>%
  subplot(nrows = NROW(.), shareX = TRUE, titleX = FALSE)

Unemployment rate vs duration

economics %>%
  plot_ly(x = ~uempmed, y = ~unemploy/pop) %>%
  add_markers(color = ~as.numeric(date), text = ~date, hoverinfo = "text")

Your Turn

Get creative and find something interesting in the economics data!

Stock prices

library(quantmod)
msft <- getSymbols("MSFT", auto.assign = F)
dat <- as.data.frame(msft)
dat$date <- index(msft)
head(dat)
#>            MSFT.Open MSFT.High MSFT.Low MSFT.Close MSFT.Volume
#> 2007-01-03     29.91     30.25    29.40      29.86    76935100
#> 2007-01-04     29.70     29.97    29.44      29.81    45774500
#> 2007-01-05     29.63     29.75    29.45      29.64    44607200
#> 2007-01-08     29.65     30.10    29.53      29.93    50220200
#> 2007-01-09     30.00     30.18    29.73      29.96    44636600
#> 2007-01-10     29.80     29.89    29.43      29.66    55017400
#>            MSFT.Adjusted       date
#> 2007-01-03      23.78435 2007-01-03
#> 2007-01-04      23.74452 2007-01-04
#> 2007-01-05      23.60911 2007-01-05
#> 2007-01-08      23.84011 2007-01-08
#> 2007-01-09      23.86400 2007-01-09
#> 2007-01-10      23.62504 2007-01-10

CandleStick chart

names(dat) <- sub("^MSFT\\.", "", names(dat))

plot_ly(dat, x = ~date, xend = ~date, color = ~Close > Open, 
        colors = c("red", "forestgreen"), hoverinfo = "none") %>%
  add_segments(y = ~Low, yend = ~High, line = list(width = 1)) %>%
  add_segments(y = ~Open, yend = ~Close, line = list(width = 3)) %>%
  layout(showlegend = FALSE, yaxis = list(title = "Price")) %>%
  toWebGL()

See here for the plot.

Accessing plotly events in shiny

library(shiny)
ui <- fluidPage(
  plotlyOutput("plot"),
  verbatimTextOutput("click"),
  verbatimTextOutput("brush")
)
server <- function(input, output, session) {
  output$plot <- renderPlotly({
    mtcars %>%
      plot_ly(x = ~mpg, y = ~wt, key = row.names(mtcars)) %>%
      add_markers() %>%
      layout(dragmode = "select")
  })
  output$click <- renderPrint({
    event_data("plotly_click")
  })
  output$brush <- renderPrint({
    event_data("plotly_selected")
  })
}
shinyApp(ui, server)

Linked correlation matrix

Thank you!